fix: don't load entire archived workflow into memory in list APIs #12912

jiachengxu · 2024-04-08T13:50:08Z

Fixes #11121 (comment), Related to #12025 (comment)

This PR fixes the regression introduced in #11121 (comment). In this PR, we don't load the entire workflow to the memory since that causes high memory usage and slow query performance.

Signed-off-by: Jiacheng Xu <[email protected]>

jessesuen

LGTM!

agilgur5 · 2024-04-19T20:14:30Z

persist/sqldb/workflow_archive.go

+				StartedAt:  v1.Time{Time: md.StartedAt},
+				FinishedAt: v1.Time{Time: md.FinishedAt},
+				Progress:   wfv1.Progress(md.Progress),
+			},


What did you base the metadata requirements on? Based on workflows-service.ts and reports.tsx, we may also want status.message, status.estimatedDuration, status.resourceDuration, and spec.suspend

I was checking columns in the workflow UI and decided what metadata to include, but if the fields that you mentioned are required, I can open another PR to add them

Technically speaking, fields is a query parameter in the API, so ideally we'd dynamically set the fields.

If you didn't notice some of them it may be because you can expand items in the list?

Oh, that could be, I will prepare a PR to add those fields.

This seems to be causing a regression now: #13098

Hmm looking at #11121 (comment) more, it may have also unintentionally introduced a feature where you can get all parts of an Archived Workflows in the list API, whereas you could not in 3.4 if I'm reading correctly.

I.e. fixing that bug also removed that feature. I don't necessarily mind that if it wasn't present in 3.4, but I can see how that could cause some back-and-forth turbulence & instability for users (which, well, #11121 caused a lot of regressions & instability unfortunately (and predates me))

PR to add the above missing fields: #13136

…2912) Signed-off-by: Jiacheng Xu <[email protected]> (cherry picked from commit f80b9e8)

agilgur5 · 2024-04-19T20:15:26Z

Backported cleanly into release-3.5 as 200f4d1

Signed-off-by: Mason Malone <[email protected]>

…j#13601 As explained in argoproj#13601 (comment), I believe argoproj#12912 introduced a performance regression when listing workflows for PostgreSQL users. Reverting that PR could re-introduce the memory issues mentioned in the PR description, so instead this mitigates the impact by converting the `workflow` column to be of type `jsonb`. Initially `workflow` was of type `text`, and was migrated to `json` in argoproj#2152. I'm not sure why `jsonb` wasn't chosen, but [based on this comment in the linked issue](argoproj#2133 (comment)), I think it was simply an oversight. Here's the relevant docs (https://www.postgresql.org/docs/current/datatype-json.html): > The json and jsonb data types accept almost identical sets of values as input. The major practical difference is one of efficiency. The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage. > > Because the json type stores an exact copy of the input text, it will preserve semantically-insignificant white space between tokens, as well as the order of keys within JSON objects. Also, if a JSON object within the value contains the same key more than once, all the key/value pairs are kept. (The processing functions consider the last value as the operative one.) By contrast, jsonb does not preserve white space, does not preserve the order of object keys, and does not keep duplicate object keys. If duplicate keys are specified in the input, only the last value is kept. > > In general, most applications should prefer to store JSON data as jsonb, unless there are quite specialized needs, such as legacy assumptions about ordering of object keys. I'm pretty sure we don't care about key order or whitespace. We do care somewhat about insertion speed, but archived workflows are read much more frequently than written, so a slight reduction in write speed that gives a large improvement in read speed is a good tradeoff. Here's steps to test this: 1. Use argoproj#13715 to generate 100,000 randomized workflows, with https://gist.github.com/MasonM/52932ff6644c3c0ccea9e847780bfd90 as a template: ``` $ time go run ./hack/db fake-archived-workflows --template "@very-large-workflow.yaml" --rows 100000 Using seed 1935828722624432788 Clusters: [default] Namespaces: [argo] Inserted 100000 rows real 18m35.316s user 3m2.447s sys 0m44.972s ``` 2. Run the benchmarks using argoproj#13767: ``` make BenchmarkWorkflowArchive > postgres_before_10000_workflows.txt ``` 3. Run the migration the DB CLI: ``` $ time go run ./hack/db migrate INFO[0000] Migrating database schema clusterName=default dbType=postgres INFO[0000] applying database change change="alter table argo_archived_workflows alter column workflow set data type jsonb using workflow::jsonb" changeSchemaVersion=60 2024/10/17 18:07:42 Session ID: 00001 Query: alter table argo_archived_workflows alter column workflow set data type jsonb using workflow::jsonb Stack: fmt.(*pp).handleMethods@/usr/local/go/src/fmt/print.go:673 fmt.(*pp).printArg@/usr/local/go/src/fmt/print.go:756 fmt.(*pp).doPrint@/usr/local/go/src/fmt/print.go:1208 fmt.Append@/usr/local/go/src/fmt/print.go:289 log.(*Logger).Print.func1@/usr/local/go/src/log/log.go:261 log.(*Logger).output@/usr/local/go/src/log/log.go:238 log.(*Logger).Print@/usr/local/go/src/log/log.go:260 github.com/argoproj/argo-workflows/v3/persist/sqldb.ansiSQLChange.apply@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/ansi_sql_change.go:11 github.com/argoproj/argo-workflows/v3/persist/sqldb.migrate.applyChange.func1@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/migrate.go:295 github.com/argoproj/argo-workflows/v3/persist/sqldb.migrate.applyChange@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/migrate.go:284 github.com/argoproj/argo-workflows/v3/persist/sqldb.migrate.Exec@/home/vscode/go/src/github.com/argoproj/argo-workflows/persist/sqldb/migrate.go:273 main.NewMigrateCommand.func1@/home/vscode/go/src/github.com/argoproj/argo-workflows/hack/db/main.go:50 github.com/spf13/cobra.(*Command).execute@/home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:985 github.com/spf13/cobra.(*Command).ExecuteC@/home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:1117 github.com/spf13/cobra.(*Command).Execute@/home/vscode/go/pkg/mod/github.com/spf13/[email protected]/command.go:1041 main.main@/home/vscode/go/src/github.com/argoproj/argo-workflows/hack/db/main.go:39 runtime.main@/usr/local/go/src/runtime/proc.go:272 runtime.goexit@/usr/local/go/src/runtime/asm_amd64.s:1700 Rows affected: 0 Error: upper: slow query Time taken: 69.12755s Context: context.Background real 1m10.087s user 0m1.541s sys 0m0.410s ``` 2. Re-run the benchmarks: ``` make BenchmarkWorkflowArchive > postgres_after_10000_workflows.txt ``` 4. Compare results using [benchstat](https://pkg.go.dev/golang.org/x/perf/cmd/benchstat): ``` $ benchstat postgres_before_10000_workflows3.txt postgres_after_10000_workflows2.txt goos: linux goarch: amd64 pkg: github.com/argoproj/argo-workflows/v3/test/e2e cpu: 12th Gen Intel(R) Core(TM) i5-12400 │ postgres_before_10000_workflows3.txt │ postgres_after_10000_workflows2.txt │ │ sec/op │ sec/op vs base │ WorkflowArchive/ListWorkflows-12 183.83m ± ∞ ¹ 24.69m ± ∞ ¹ ~ (p=1.000 n=1) ² WorkflowArchive/ListWorkflows_with_label_selector-12 192.71m ± ∞ ¹ 25.87m ± ∞ ¹ ~ (p=1.000 n=1) ² WorkflowArchive/CountWorkflows-12 13.04m ± ∞ ¹ 11.75m ± ∞ ¹ ~ (p=1.000 n=1) ² geomean 77.31m 19.58m -74.68% ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ postgres_before_10000_workflows3.txt │ postgres_after_10000_workflows2.txt │ │ B/op │ B/op vs base │ WorkflowArchive/ListWorkflows-12 497.2Ki ± ∞ ¹ 497.5Ki ± ∞ ¹ ~ (p=1.000 n=1) ² WorkflowArchive/ListWorkflows_with_label_selector-12 503.1Ki ± ∞ ¹ 503.9Ki ± ∞ ¹ ~ (p=1.000 n=1) ² WorkflowArchive/CountWorkflows-12 8.972Ki ± ∞ ¹ 8.899Ki ± ∞ ¹ ~ (p=1.000 n=1) ² geomean 130.9Ki 130.7Ki -0.20% ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 │ postgres_before_10000_workflows3.txt │ postgres_after_10000_workflows2.txt │ │ allocs/op │ allocs/op vs base │ WorkflowArchive/ListWorkflows-12 8.373k ± ∞ ¹ 8.370k ± ∞ ¹ ~ (p=1.000 n=1) ² WorkflowArchive/ListWorkflows_with_label_selector-12 8.410k ± ∞ ¹ 8.406k ± ∞ ¹ ~ (p=1.000 n=1) ² WorkflowArchive/CountWorkflows-12 212.0 ± ∞ ¹ 212.0 ± ∞ ¹ ~ (p=1.000 n=1) ³ geomean 2.462k 2.462k -0.03% ¹ need >= 6 samples for confidence interval at level 0.95 ² need >= 4 samples to detect a difference at alpha level 0.05 ³ all samples are equal ``` Signed-off-by: Mason Malone <[email protected]>

fix: load entire archived workflow into memory in list APIs

fd5f9da

Signed-off-by: Jiacheng Xu <[email protected]>

jiachengxu mentioned this pull request Apr 8, 2024

3.5 ListWorkflows causes server to hang when there are lots of archived workflows #12025

Closed

3 tasks

agilgur5 changed the title ~~fix: load entire archived workflow into memory in list APIs~~ fix: don't load entire archived workflow into memory in list APIs Apr 8, 2024

agilgur5 self-assigned this Apr 8, 2024

agilgur5 added area/api Argo Server API area/workflow-archive labels Apr 8, 2024

jiachengxu mentioned this pull request Apr 8, 2024

REQUEST: New membership for jiachengxu argoproj/argoproj#294

Closed

6 tasks

fix: add support for mysql

1cd9117

Signed-off-by: Jiacheng Xu <[email protected]>

jiachengxu force-pushed the select-archived-workflow-fix branch from 8dc8d1a to 1cd9117 Compare April 8, 2024 17:22

jiachengxu added 2 commits April 9, 2024 19:17

fix: handle null value from archived store

40d8cec

Signed-off-by: Jiacheng Xu <[email protected]>

Merge branch 'main' into select-archived-workflow-fix

f52c26b

agilgur5 mentioned this pull request Apr 12, 2024

feat: Unified workflows list UI and API #11121

Merged

Merge branch 'main' into select-archived-workflow-fix

4b380cf

agilgur5 added this to the v3.5.x patches milestone Apr 12, 2024

jessesuen approved these changes Apr 16, 2024

View reviewed changes

terrytangyuan approved these changes Apr 16, 2024

View reviewed changes

terrytangyuan merged commit f80b9e8 into argoproj:main Apr 16, 2024
28 checks passed

agilgur5 linked an issue Apr 17, 2024 that may be closed by this pull request

3.5 ListWorkflows causes server to hang when there are lots of archived workflows #12025

Closed

3 tasks

agilgur5 reviewed Apr 19, 2024

View reviewed changes

agilgur5 pushed a commit that referenced this pull request Apr 19, 2024

fix: don't load entire archived workflow into memory in list APIs (#1…

200f4d1

…2912) Signed-off-by: Jiacheng Xu <[email protected]> (cherry picked from commit f80b9e8)

This was referenced May 27, 2024

3.5.6+ items.status.nodes disappeared from /api/v1/workflows endpoint for completed Workflows #13098

Open

3.5 Pagination in workflow list page not working #13090

Closed

jiachengxu deleted the select-archived-workflow-fix branch June 14, 2024 16:45

spaced mentioned this pull request Jun 17, 2024

3.5.6: Archive workflow sql syntax error with mariadb #13202

Closed

4 tasks

agilgur5 mentioned this pull request Jun 17, 2024

fix(server): switch to JSON_EXTRACT and JSON_UNQUOTE for MySQL/MariaDB. Fixes #13202 #13203

Merged

agilgur5 mentioned this pull request Jul 25, 2024

fix: improve get archived workflow query performance during controller estimation. Fixes #13382 #13394

Merged

agilgur5 mentioned this pull request Sep 7, 2024

3.5: Improve Archived API/DB read performance #13295

Closed

4 tasks

agilgur5 mentioned this pull request Sep 20, 2024

Optimize the content of the list of archived workflows sent to front-end #12030

Closed

MasonM added a commit to MasonM/argo-workflows that referenced this pull request Oct 17, 2024

fix: revert argoproj#12912

cd394be

Signed-off-by: Mason Malone <[email protected]>

MasonM mentioned this pull request Oct 17, 2024

3.5: Further optimize Archive List API call / DB query #13601

Open

4 tasks

MasonM mentioned this pull request Oct 17, 2024

fix!: migrate argo_archived_workflows.workflow to jsonb #13779

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: don't load entire archived workflow into memory in list APIs #12912

fix: don't load entire archived workflow into memory in list APIs #12912

jiachengxu commented Apr 8, 2024 •

edited by agilgur5

Loading

jessesuen left a comment

agilgur5 Apr 19, 2024 •

edited

Loading

jiachengxu Apr 19, 2024

agilgur5 Apr 26, 2024 •

edited

Loading

jiachengxu Apr 29, 2024

agilgur5 May 27, 2024

agilgur5 May 27, 2024 •

edited

Loading

jiachengxu Jun 2, 2024

agilgur5 commented Apr 19, 2024

fix: don't load entire archived workflow into memory in list APIs #12912

fix: don't load entire archived workflow into memory in list APIs #12912

Conversation

jiachengxu commented Apr 8, 2024 • edited by agilgur5 Loading

jessesuen left a comment

Choose a reason for hiding this comment

agilgur5 Apr 19, 2024 • edited Loading

Choose a reason for hiding this comment

jiachengxu Apr 19, 2024

Choose a reason for hiding this comment

agilgur5 Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

jiachengxu Apr 29, 2024

Choose a reason for hiding this comment

agilgur5 May 27, 2024

Choose a reason for hiding this comment

agilgur5 May 27, 2024 • edited Loading

Choose a reason for hiding this comment

jiachengxu Jun 2, 2024

Choose a reason for hiding this comment

agilgur5 commented Apr 19, 2024

jiachengxu commented Apr 8, 2024 •

edited by agilgur5

Loading

agilgur5 Apr 19, 2024 •

edited

Loading

agilgur5 Apr 26, 2024 •

edited

Loading

agilgur5 May 27, 2024 •

edited

Loading