Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't load entire archived workflow into memory in list APIs #12912

Merged
merged 5 commits into from
Apr 16, 2024

Conversation

jiachengxu
Copy link
Member

@jiachengxu jiachengxu commented Apr 8, 2024

Fixes #11121 (comment), Related to #12025 (comment)

This PR fixes the regression introduced in #11121 (comment). In this PR, we don't load the entire workflow to the memory since that causes high memory usage and slow query performance.

@agilgur5 agilgur5 changed the title fix: load entire archived workflow into memory in list APIs fix: don't load entire archived workflow into memory in list APIs Apr 8, 2024
@agilgur5 agilgur5 self-assigned this Apr 8, 2024
Signed-off-by: Jiacheng Xu <[email protected]>
@jiachengxu jiachengxu force-pushed the select-archived-workflow-fix branch from 8dc8d1a to 1cd9117 Compare April 8, 2024 17:22
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Apr 12, 2024
Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@terrytangyuan terrytangyuan merged commit f80b9e8 into argoproj:main Apr 16, 2024
28 checks passed
StartedAt: v1.Time{Time: md.StartedAt},
FinishedAt: v1.Time{Time: md.FinishedAt},
Progress: wfv1.Progress(md.Progress),
},
Copy link
Member

@agilgur5 agilgur5 Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you base the metadata requirements on? Based on workflows-service.ts and reports.tsx, we may also want status.message, status.estimatedDuration, status.resourceDuration, and spec.suspend

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was checking columns in the workflow UI and decided what metadata to include, but if the fields that you mentioned are required, I can open another PR to add them

Copy link
Member

@agilgur5 agilgur5 Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically speaking, fields is a query parameter in the API, so ideally we'd dynamically set the fields.

If you didn't notice some of them it may be because you can expand items in the list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that could be, I will prepare a PR to add those fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be causing a regression now: #13098

Copy link
Member

@agilgur5 agilgur5 May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm looking at #11121 (comment) more, it may have also unintentionally introduced a feature where you can get all parts of an Archived Workflows in the list API, whereas you could not in 3.4 if I'm reading correctly.

I.e. fixing that bug also removed that feature. I don't necessarily mind that if it wasn't present in 3.4, but I can see how that could cause some back-and-forth turbulence & instability for users (which, well, #11121 caused a lot of regressions & instability unfortunately (and predates me))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR to add the above missing fields: #13136

agilgur5 pushed a commit that referenced this pull request Apr 19, 2024
@agilgur5
Copy link
Member

Backported cleanly into release-3.5 as 200f4d1

MasonM added a commit to MasonM/argo-workflows that referenced this pull request Oct 17, 2024
…j#13601

As explained in
argoproj#13601 (comment),
I believe argoproj#12912
introduced a performance regression when listing workflows for
PostgreSQL users. Reverting that PR could re-introduce the memory issues
mentioned in the PR description, so instead this mitigates the impact by
converting the `workflow` column to be of type `jsonb`.

Initially `workflow` was of type `text`, and was migrated to `json` in
argoproj#2152. I'm not sure why
`jsonb` wasn't chosen, but [based on this comment in the linked
issue](argoproj#2133 (comment)),
I think it was simply an oversight. Here's the relevant docs
(https://www.postgresql.org/docs/current/datatype-json.html):

> The json and jsonb data types accept almost identical sets of values as input. The major practical difference is one of efficiency. The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage.
>
> Because the json type stores an exact copy of the input text, it will preserve semantically-insignificant white space between tokens, as well as the order of keys within JSON objects. Also, if a JSON object within the value contains the same key more than once, all the key/value pairs are kept. (The processing functions consider the last value as the operative one.) By contrast, jsonb does not preserve white space, does not preserve the order of object keys, and does not keep duplicate object keys. If duplicate keys are specified in the input, only the last value is kept.
>
> In general, most applications should prefer to store JSON data as jsonb, unless there are quite specialized needs, such as legacy assumptions about ordering of object keys.

I'm pretty sure we don't care about key order or whitespace. We do care
somewhat about insertion speed, but archived workflows are read much
more frequently than written, so a slight reduction in write speed that
gives a large improvement in read speed is a good tradeoff.

Here's steps to test this:
1. Use argoproj#13715 to generate 100,000 randomized workflows, with https://gist.github.com/MasonM/52932ff6644c3c0ccea9e847780bfd90 as a template:

     ```
     $ time go run ./hack/db fake-archived-workflows  --template "@very-large-workflow.yaml" --rows 100000
     Using seed 1935828722624432788
     Clusters: [default]
     Namespaces: [argo]
     Inserted 100000 rows

     real    18m35.316s
     user    3m2.447s
     sys     0m44.972s
     ```
2. Run the benchmarks using argoproj#13767:
    ```
    make BenchmarkWorkflowArchive > postgres_before_10000_workflows.txt
    ```
3. Run the migration the DB CLI:
    ```
    $ time go run ./hack/db migrate
    INFO[0000] Migrating database schema                     clusterName=default dbType=postgres
    INFO[0000] applying database change                      change="alter table argo_archived_workflows alter column workflow set data type jsonb using workflow::jsonb" changeSchemaVersion=60
    2024/10/17 18:07:42     Session ID:     00001
            Query:          alter table argo_archived_workflows alter column workflow set data type jsonb using workflow::jsonb
            Stack:
                    fmt.(*pp).handleMethods@/usr/local/go/src/fmt/print.go:673
                    fmt.(*pp).printArg@/usr/local/go/src/fmt/print.go:756
                    fmt.(*pp).doPrint@/usr/local/go/src/fmt/print.go:1208
                    fmt.Append@/usr/local/go/src/fmt/print.go:289
                    log.(*Logger).Print.func1@/usr/local/go/src/log/log.go:261
                    log.(*Logger).output@/usr/local/go/src/log/log.go:238
                    log.(*Logger).Print@/usr/local/go/src/log/log.go:260
                    github.com/argoproj/argo-workflows/v3/persist/sqldb.ansiSQLChange.apply@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/ansi_sql_change.go:11
                    github.com/argoproj/argo-workflows/v3/persist/sqldb.migrate.applyChange.func1@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/migrate.go:295
                    github.com/argoproj/argo-workflows/v3/persist/sqldb.migrate.applyChange@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/migrate.go:284
                    github.com/argoproj/argo-workflows/v3/persist/sqldb.migrate.Exec@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/migrate.go:273
                    main.NewMigrateCommand.func1@/home/vscode/go/src/github.com/argoproj/argo-workflows/hack/db/main.go:50
                    github.com/spf13/cobra.(*Command).execute@/home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:985
                    github.com/spf13/cobra.(*Command).ExecuteC@/home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:1117
                    github.com/spf13/cobra.(*Command).Execute@/home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:1041
                    main.main@/home/vscode/go/src/github.com/argoproj/argo-workflows/hack/db/main.go:39
                    runtime.main@/usr/local/go/src/runtime/proc.go:272
                    runtime.goexit@/usr/local/go/src/runtime/asm_amd64.s:1700
            Rows affected:  0
            Error:          upper: slow query
            Time taken:     69.12755s
            Context:        context.Background

    real    1m10.087s
    user    0m1.541s
    sys     0m0.410s
    ```
2. Re-run the benchmarks:
    ```
    make BenchmarkWorkflowArchive > postgres_after_10000_workflows.txt
    ```
4. Compare results using [benchstat](https://pkg.go.dev/golang.org/x/perf/cmd/benchstat):
    ```
    $ benchstat postgres_before_10000_workflows3.txt postgres_after_10000_workflows2.txt
    goos: linux
    goarch: amd64
    pkg: github.com/argoproj/argo-workflows/v3/test/e2e
    cpu: 12th Gen Intel(R) Core(TM) i5-12400
                                                         │ postgres_before_10000_workflows3.txt │  postgres_after_10000_workflows2.txt  │
                                                         │                sec/op                │    sec/op     vs base                 │
    WorkflowArchive/ListWorkflows-12                                              183.83m ± ∞ ¹   24.69m ± ∞ ¹        ~ (p=1.000 n=1) ²
    WorkflowArchive/ListWorkflows_with_label_selector-12                          192.71m ± ∞ ¹   25.87m ± ∞ ¹        ~ (p=1.000 n=1) ²
    WorkflowArchive/CountWorkflows-12                                              13.04m ± ∞ ¹   11.75m ± ∞ ¹        ~ (p=1.000 n=1) ²
    geomean                                                                        77.31m         19.58m        -74.68%
    ¹ need >= 6 samples for confidence interval at level 0.95
    ² need >= 4 samples to detect a difference at alpha level 0.05

                                                         │ postgres_before_10000_workflows3.txt │  postgres_after_10000_workflows2.txt  │
                                                         │                 B/op                 │     B/op       vs base                │
    WorkflowArchive/ListWorkflows-12                                              497.2Ki ± ∞ ¹   497.5Ki ± ∞ ¹       ~ (p=1.000 n=1) ²
    WorkflowArchive/ListWorkflows_with_label_selector-12                          503.1Ki ± ∞ ¹   503.9Ki ± ∞ ¹       ~ (p=1.000 n=1) ²
    WorkflowArchive/CountWorkflows-12                                             8.972Ki ± ∞ ¹   8.899Ki ± ∞ ¹       ~ (p=1.000 n=1) ²
    geomean                                                                       130.9Ki         130.7Ki        -0.20%
    ¹ need >= 6 samples for confidence interval at level 0.95
    ² need >= 4 samples to detect a difference at alpha level 0.05

                                                         │ postgres_before_10000_workflows3.txt │ postgres_after_10000_workflows2.txt  │
                                                         │              allocs/op               │  allocs/op    vs base                │
    WorkflowArchive/ListWorkflows-12                                               8.373k ± ∞ ¹   8.370k ± ∞ ¹       ~ (p=1.000 n=1) ²
    WorkflowArchive/ListWorkflows_with_label_selector-12                           8.410k ± ∞ ¹   8.406k ± ∞ ¹       ~ (p=1.000 n=1) ²
    WorkflowArchive/CountWorkflows-12                                               212.0 ± ∞ ¹    212.0 ± ∞ ¹       ~ (p=1.000 n=1) ³
    geomean                                                                        2.462k         2.462k        -0.03%
    ¹ need >= 6 samples for confidence interval at level 0.95
    ² need >= 4 samples to detect a difference at alpha level 0.05
    ³ all samples are equal
    ```

Signed-off-by: Mason Malone <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3.5 ListWorkflows causes server to hang when there are lots of archived workflows
4 participants