3.5: Improve Archived API/DB read performance #13295

gyanprakash48 · 2024-07-02T09:02:51Z

edited by agilgur5:

Please help test fix(api): optimise archived list query. Fixes #13295 #13566 by running the test image below

Pre-requisites

I have double-checked my configuration
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

Currently , when large number of workflows are running , Workflows view of Argo workflows UI is taking up to 12 seconds to load. I don't find any way to further optimise it. CPU and Memory uses of server pod are normal. I am using Postgres to archive the workflows and cpu and memory usage of database as well are normal. i don't see any spikes anywhere. So looks like its not resource contention but something else, What can be done to improve the load time .

Version

3.5.0, 3.5.6, 3.5.8,

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Its UI load time (workflow view)

Logs from the workflow controller

Its UI load time (workflow view)

Logs from in your workflow's wait container

Its UI load time (workflow view)

The text was updated successfully, but these errors were encountered:

gyanprakash48 · 2024-07-03T13:23:17Z

i tried to trace it and seem like below query is culprit . As per query performance insight(Azure Postgres flexi server) , this query execution mean time is 9.304s, Which explains, over all api time is 12-13 seconds

select
	name,
	namespace,
	uid,
	phase,
	startedat,
	finishedat,
	coalesce((workflow::json)->'metadata'->>'labels',
	'{}') as labels,
	coalesce((workflow::json)->'metadata'->>'annotations',
	'{}') as annotations,
	coalesce((workflow::json)->'status'->>'progress',
	'') as progress,
	coalesce((workflow::json)->'metadata'->>'creationTimestamp',
	'') as creationtimestamp,
	(workflow::json)->'spec'->>'suspend' as suspend,
	coalesce((workflow::json)->'status'->>'message',
	'') as message,
	coalesce((workflow::json)->'status'->>'estimatedDuration',
	'0') as estimatedduration,
	coalesce((workflow::json)->'status'->>'resourcesDuration',
	'{}') as resourcesduration
from
	"argo_archived_workflows"
where
	(("clustername" = $1
		and "namespace" = $2
		and "instanceid" = $3)
	and "namespace" = $4
	and not exists (
	select
		1
	from
		argo_archived_workflows_labels
	where
		clustername = argo_archived_workflows.clustername
		and uid = argo_archived_workflows.uid
		and name = 'workflows.argoproj.io/controller-instanceid'))
order by
	"startedat" desc
limit 50

gyanprakash48 · 2024-07-03T15:08:00Z

These json functions are culprit,

coalesce((workflow::json)->'metadata'->>'labels',
'{}') as labels,
coalesce((workflow::json)->'metadata'->>'annotations',
'{}') as annotations,
coalesce((workflow::json)->'status'->>'progress',
'') as progress,
coalesce((workflow::json)->'metadata'->>'creationTimestamp',
'') as creationtimestamp,
(workflow::json)->'spec'->>'suspend' as suspend,
coalesce((workflow::json)->'status'->>'message',
'') as message,
coalesce((workflow::json)->'status'->>'estimatedDuration',
'0') as estimatedduration,
coalesce((workflow::json)->'status'->>'resourcesDuration',

driv · 2024-07-03T15:40:44Z

I've been testing the queries from the logs of 3.5.8 and I can see a huge difference between MySQL 5.7 and 8.

The queries execution plans are different. On 8 it executes more than 2 orders of magnitude faster.

I'm attaching the queries EXPLAIN.

explain.57.md
explain.8.md

MySQL 8 is able to use range.

gyanprakash48 · 2024-07-03T16:17:31Z

hmm, i am using Postgres Flexi server , so not sure if it comparable.
What about storing these columns as generated column in the table, and in api query using those generated columns. something like below. I tested it for same data where currently it is taking 9 sec, revised query completing in milliseconds. does it have any downside ? Get Workflows route is most frequently used in UI and its slowness taking away all the charm of argo workflows (Users always feel frustrated).

table DDL

CREATE TABLE argo_archived_workflows (
	uid varchar(128) NOT NULL,
	"name" varchar(256) NOT NULL,
	phase varchar(25) NOT NULL,
	"namespace" varchar(256) NOT NULL,
	workflow json NOT NULL,
	startedat timestamp DEFAULT CURRENT_TIMESTAMP NOT NULL,
	finishedat timestamp DEFAULT CURRENT_TIMESTAMP NOT NULL,
	clustername varchar(64) NOT NULL,
	instanceid varchar(64) NOT NULL,
	labels text GENERATED ALWAYS AS ((workflow::json)->'metadata'->>'labels') stored,
	annotations text GENERATED ALWAYS as ( (workflow::json)->'metadata'->>'annotations') stored,
	progress text GENERATED ALWAYS AS ((workflow::json)->'status'->>'progress') stored,
	creationtimestamp text GENERATED ALWAYS AS ((workflow::json)->'metadata'->>'creationTimestamp') stored,
	suspend text GENERATED ALWAYS AS ((workflow::json)->'spec'->>'suspend') stored,
	message text GENERATED ALWAYS AS ((workflow::json)->'status'->>'message') stored,
	estimatedduration text GENERATED ALWAYS AS ((workflow::json)->'status'->>'estimatedDuration') stored,
	resourcesduration text GENERATED ALWAYS AS ((workflow::json)->'status'->>'resourcesDuration') stored,  
	CONSTRAINT argo_archived_workflows_pkey PRIMARY KEY (clustername, uid)
);
CREATE INDEX argo_archived_workflows_i1 ON argo_archived_workflows USING btree (clustername, instanceid, namespace);
CREATE INDEX argo_archived_workflows_i2 ON argo_archived_workflows USING btree (clustername, instanceid, finishedat);
CREATE INDEX argo_archived_workflows_i3 ON argo_archived_workflows USING btree (clustername, instanceid, name);
CREATE INDEX argo_archived_workflows_i4 ON argo_archived_workflows USING btree (startedat);

Revised Query

select
	name,
	namespace,
	uid,
	phase,
	startedat,
	finishedat,
	coalesce(labels,'{}') as labels,
	coalesce(annotations,'{}') as annotations,
	coalesce(progress,'') as progress,
	coalesce(creationtimestamp,'') as creationtimestamp,
	suspend,
	coalesce(message,'') as message,
	coalesce(estimatedduration,'0') as estimatedduration,
	coalesce(resourcesduration,'{}') as resourcesduration
from
	argo_archived_workflows
where
	(("clustername" = $1
		and "namespace" = $2
		and "instanceid" = $3)
	and "namespace" = $4
	and not exists (
	select
		1
	from
		argo_archived_workflows_labels
	where
		clustername = 'default'
		and uid = argo_archived_workflows.uid
		and name = 'workflows.argoproj.io/controller-instanceid'))
order by startedat desc
limit 45;

driv · 2024-07-03T17:02:02Z

Given that the query has a limit 50 and those fields are not used in filtering or sorting, I doubt they are responsible for the long query duration.

Can you share an EXPLAIN?

gyanprakash48 · 2024-07-03T17:54:36Z

in the beginning i though that too, but trying different strategy to switch different explain plan was not helping at all . Then i comment out those json function column in select, and suddenly query was super fast. Even Though they are not part of sorting and filtering, but these json are very big in size(If workflows is large , at least thats the case in my situation)

gyanprakash48 · 2024-07-03T18:46:40Z

May be this recording will help further. I am sure this is only, one of the of the scenario (must be many others)

i ran both aspect twice (just to avoid concern like catch an all )

agilgur5 · 2024-07-06T01:52:23Z

Then i comment out those json function column in select, and suddenly query was super fast. Even Though they are not part of sorting and filtering, but these json are very big in size(If workflows is large , at least thats the case in my situation)

Yea they can be very large, though I'm surprised it takes that long, since it's not part of the filter. I'm wondering if the engine is extracting the JSON multiple times, once for each field?

I tested it for same data where currently it is taking 9 sec, revised query completing in milliseconds. does it have any downside ?

	labels text GENERATED ALWAYS AS ((workflow::json)->'metadata'->>'labels') stored,
	annotations text GENERATED ALWAYS as ( (workflow::json)->'metadata'->>'annotations') stored,
	progress text GENERATED ALWAYS AS ((workflow::json)->'status'->>'progress') stored,
	creationtimestamp text GENERATED ALWAYS AS ((workflow::json)->'metadata'->>'creationTimestamp') stored,
	suspend text GENERATED ALWAYS AS ((workflow::json)->'spec'->>'suspend') stored,
	message text GENERATED ALWAYS AS ((workflow::json)->'status'->>'message') stored,
	estimatedduration text GENERATED ALWAYS AS ((workflow::json)->'status'->>'estimatedDuration') stored,
	resourcesduration text GENERATED ALWAYS AS ((workflow::json)->'status'->>'resourcesDuration') stored,

We don't currently filter on those, so it's a little unnecessary as individual columns, but otherwise that makes sense to me if it's compatible with MySQL as well. I'm wondering if we can just use a materialized view instead for similar effect?

cc @jiachengxu @terrytangyuan @jessesuen

jlpgz · 2024-07-08T19:14:42Z

We are experiencing the same problem using PostgreSQL. There are nearly 1 million archived workflows in the database, which are quite large, making the query extremely expensive to run. As a result, the UI becomes unusable due to its slow response time.

I have always wondered why the controller does not create a JSONB column to store the workflow information.

jlpgz · 2024-07-09T13:19:38Z

Would it be possible to change the workflow column to JSONB in Postgres via migration? JSON operations would be faster and we could use GIN indexes.

jiachengxu · 2024-07-09T14:23:51Z

Would it be possible to change the workflow column to JSONB in Postgres via migration? JSON operations would be faster and we could use GIN indexes.

I think this is a good point, currently for Postgres, we use json but not JSONB. In the archived workflow case(heavy read), I think it makes sense to use JSONB since:

The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage.

https://www.postgresql.org/docs/current/datatype-json.html

Yea they can be very large, though I'm surprised it takes that long, since it's not part of the filter. I'm wondering if the engine is extracting the JSON multiple times, once for each field?

@agilgur5 To your point, from the above quote "The json data type stores an exact copy of the input text, which processing functions must reparse on each execution", I think for current implementation, most likely it is reparsing the json for each field.

agilgur5 · 2024-07-09T18:10:41Z

and we could use GIN indexes.

We don't need indexes on it currently as the JSON is not used in filtering.

I have always wondered why the controller does not create a JSONB column to store the workflow information.

Otherwise, it does sound like JSONB is the better choice for our read queries. This is only supported for Postgres though, and we'd have to add the min version of Postgres required for JSONB to the docs. I imagine lack of MySQL support and legacy code is the reason but 🤷

Is it straightforward to write a migration to JSONB?

We'd also still have to solve this for MySQL, so that alone wouldn't be enough. A materialized view or generated columns are probably needed, and compressed nodes (#13313) might help too.

I think for current implementation, most likely it is reparsing the json for each field.

That would explain why it may have gotten slower. That is odd to me that it's reparsed (as the engine should be able to store it in memory and then reuse it for the query), but materialized views / generated columns should solve that relatively cleanly, assuming that writes are not heavily slowed down (e.g. if it reparses each field during derivation/generation too)

hmoulart · 2024-07-10T09:03:35Z

Hello,
I have the same behavior on that part with the database overloaded with archiveTTL: 180d
Add an index on some columns was not enough...

database is a AWS RDS PostgreSQL t4g.small

archived_workflows: 816 757
argo_archived_workflows: 37 427
Number of line is not so huge...

We created a job on PostgreSQL database to keep 2 days on the active argo table and 178d on the archive argo table. The situation is now better.
But Argo hit the database very often for no reason with the query you can see on the screen (which is the same as the query on this topic).
I'm available to help identify the issue and optimize Argo to avoid overloading databases.
Thanks for your support!

jlpgz · 2024-07-10T09:05:06Z

Is it straightforward to write a migration to JSONB?

I would say yes. My only concern is how long it might take to convert the workflow column for a large number of rows.

ALTER TABLE argo_archived_workflows
ALTER COLUMN workflow
SET DATA TYPE JSONB
USING workflow::JSONB;

gyanprakash48 · 2024-07-17T12:30:44Z

changing to type JSONB solves the performance issue completely ? someone has validated it ?

…13295 Signed-off-by: linzhengen <[email protected]>

jlpgz · 2024-07-31T18:42:41Z

But the query is still converting the workflow column to json, isn’t it?

agilgur5 · 2024-07-31T21:28:32Z

Sorry for the confusion. The migration completed. So now the table is using jsonb for workflows.

After the migration, I reran the query (without the limit), and there was no change in performance

Ok thanks for clarifying. I imagine that without a limit any query is going to be slow. JSONB should still be faster than the JSON variant regardless though.

But the query is still converting the workflow column to json, isn’t it?

Is there a separate query for JSONB specifically? I actually don't know

My recommendation here is to just widen the table. It seems that, for whatever reason, postgres really doesn't like giant json blobs. If you splat out the queried fields, my assumption is it would be much faster.

Yea the generated columns approach originally suggested by OP and then me above is more or less equivalent to that and was significantly faster (order of magnitude or two) per testing in OP's comment.

the current implementation appears to be causing postgres always unmarshalling workflow json payload for all the records in the table. by adopting a subquery approach, we are able to optimise the query from a runtime of 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows and 1794624 argo_archived_workflows_labels. this new change is backward compatible and it is tested on our prodction env (using postgresql). realated issue: argoproj#13295 query change example: previous: ```sql SELECT name, namespace, UID, phase, startedat, finishedat, coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels, coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations, coalesce((workflow::JSON)->'status'->>'progress', '') AS progress, coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp, (workflow::JSON)->'spec'->>'suspend' AS suspend, coalesce((workflow::JSON)->'status'->>'message', '') AS message, coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration, coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration FROM "argo_archived_workflows" WHERE (("clustername" = 'default' AND "instanceid" = '') AND "namespace" = 'argo-map' AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/phase' AND value = 'Succeeded') AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/workflow-template' AND value = 'mapping1-pipeline-template-with-nfs')) ORDER BY "startedat" DESC LIMIT 1; ``` now: ```sql SELECT name, namespace, UID, phase, startedat, finishedat, coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels, coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations, coalesce((workflow::JSON)->'status'->>'progress', '') AS progress, coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp, (workflow::JSON)->'spec'->>'suspend' AS suspend, coalesce((workflow::JSON)->'status'->>'message', '') AS message, coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration, coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration FROM "argo_archived_workflows" WHERE "clustername" = 'default' AND UID IN (SELECT UID FROM "argo_archived_workflows" WHERE (("clustername" = 'default' AND "instanceid" = '') AND "namespace" = 'argo-map' AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/phase' AND value = 'Succeeded') AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/workflow-template' AND value = 'mapping1-pipeline-template-with-nfs')) ORDER BY "startedat" DESC LIMIT 1); ```

the current implementation appears to be causing postgres always unmarshalling workflow json payload for all the records in the table. by adopting a subquery approach, we are able to optimise the query from a runtime of 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows and 1794624 argo_archived_workflows_labels. this new change is backward compatible and it is tested on our prodction env (using postgresql). realated issue: argoproj#13295 query change example: previous: ```sql SELECT name, namespace, UID, phase, startedat, finishedat, coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels, coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations, coalesce((workflow::JSON)->'status'->>'progress', '') AS progress, coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp, (workflow::JSON)->'spec'->>'suspend' AS suspend, coalesce((workflow::JSON)->'status'->>'message', '') AS message, coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration, coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration FROM "argo_archived_workflows" WHERE (("clustername" = 'default' AND "instanceid" = '') AND "namespace" = 'argo-map' AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/phase' AND value = 'Succeeded') AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/workflow-template' AND value = 'mapping1-pipeline-template-with-nfs')) ORDER BY "startedat" DESC LIMIT 1; ``` now: ```sql SELECT name, namespace, UID, phase, startedat, finishedat, coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels, coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations, coalesce((workflow::JSON)->'status'->>'progress', '') AS progress, coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp, (workflow::JSON)->'spec'->>'suspend' AS suspend, coalesce((workflow::JSON)->'status'->>'message', '') AS message, coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration, coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration FROM "argo_archived_workflows" WHERE "clustername" = 'default' AND UID IN (SELECT UID FROM "argo_archived_workflows" WHERE (("clustername" = 'default' AND "instanceid" = '') AND "namespace" = 'argo-map' AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/phase' AND value = 'Succeeded') AND EXISTS (SELECT 1 FROM argo_archived_workflows_labels WHERE clustername = argo_archived_workflows.clustername AND UID = argo_archived_workflows.uid AND name = 'workflows.argoproj.io/workflow-template' AND value = 'mapping1-pipeline-template-with-nfs')) ORDER BY "startedat" DESC LIMIT 1); ``` Signed-off-by: Xiaofan Hu <[email protected]>

agilgur5 · 2024-09-07T03:53:40Z

So I've been thinking about this after discussing in the Aug 6th Contributor Meeting with @sarabala1979. All of the options we discussed to support this had trade-offs, so no optimal solution.

I was thinking about the behavior of the Archived Workflows API prior to #11121, which takes me back to my own comment, #11121 (comment). The fix to that comment was #12912 which added the JSON_EXTRACT statements instead of retrieving the entirety of the JSON, which caused both slow queries and OOMs on the Server. Now it's just a slow query at least 😅

We could remove those fields entirely from the SELECT, i.e. never return from the Archived Workflows List API, and therefore never need to extract the JSON or add new columns. This would reflect the behavior of 3.4, but would make for some confusing inconsistencies in the unified list in 3.5+, as some items of the list would have the fields, but the archived ones would not.
This seems to not strongly affect users though afaict, because #13098 did not get any upvotes and is similar.
The main problem with this approach is that it limits the types of filtering that can be done on archived workflows (e.g. annotations are extracted from the JSON), and we expected to be able to implement more (more have always been requested) after the SQLiteDB addition in #12736 / #13021. Creating new columns for every piece of metadata users want to filter on is not necessarily sustainable either though.
So, I think if necessary, we can rollback to that behavior, and then see what we can do in the future in 3.6+.

But perhaps #13566 has the root cause fix if all the errors are worked out and we won't have to think about any more options

ryancurrah · 2024-09-09T20:06:34Z

We could remove those fields entirely from the SELECT, i.e. never return from the Archived Workflows List API, and therefore never need to extract the JSON or add new columns. This would reflect the behavior of 3.4, but would make for some confusing inconsistencies in the unified list in 3.5+, as some items of the list would have the fields, but the archived ones would not.

Not sure if I fully understand this comment. We search archived workflows by labels frequently because we archive workflows almost immediately. Would that mean we could not do that anymore?

agilgur5 · 2024-09-09T23:44:54Z

I did forget a piece in my initial comment, which was that labels actually have their own table already, so it would only be other forms of information in the JSON blob that would not be filterable, like annotations, creationTimestamp, message, pieces of the status, etc. Whatever doesn't have an existing column

agilgur5 · 2024-09-10T00:59:37Z

But perhaps #13566 has the root cause fix if all the errors are worked out and we won't have to think about any more options

For other folks here who haven't been following the PR, I pushed up agilgur5/argo-cli:fix-archive-list-query-optimization for testing.

Please try it out and comment here with your query time improvement results (before/after) and DB type and version (MySQL vXXX)! Number of rows (45000) and resources given (2cpu 4GB memory) would also be helpful for proportional estimation.
Or lack thereof if it didn't work for you

agilgur5 · 2024-09-10T14:18:02Z

From @Danny5487401 in #13563 (comment)

@agilgur5 three methods i tested:

nice implemention by sub,according to fix(api): optimise archived list query. Fixes #13295 #13566

SELECT
    name, namespace, uid, phase, startedat, finishedat,
    coalesce(JSON_EXTRACT(workflow,'$.metadata.labels'), '{}') as labels,
    coalesce(JSON_EXTRACT(workflow,'$.metadata.annotations'), '{}') as annotations,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.status.progress')), '') as progress,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.metadata.creationTimestamp')), '') as creationtimestamp,
    JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.spec.suspend')) as suspend,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.status.message')), '') as message,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.status.estimatedDuration')), '0') as estimatedduration,
    coalesce(JSON_EXTRACT(workflow,'$.status.resourcesDuration'), '{}') as resourcesduration
FROM `argo_archived_workflows`
WHERE `clustername` = 'default' AND uid IN
( SELECT * FROM (SELECT uid
FROM `argo_archived_workflows`
WHERE ((`clustername` = 'default' AND `instanceid` = '') AND `namespace` = 'gllue-web'
    AND not exists
    ( select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = 'workflows.argoproj.io/controller-instanceid')
    AND exists
    ( select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = 'workflows.argoproj.io/phase' and value in ('Error', 'Failed', 'Pending', 'Running', 'Succeeded'))
          )
ORDER BY `startedat` DESC
    LIMIT 20)as x);

it drops from 37s 768ms to 1 s 981 ms

force index (argo_archived_workflows_i4)

SELECT
    name, namespace, uid, phase, startedat, finishedat,
    coalesce(JSON_EXTRACT(workflow,'$.metadata.labels'), '{}') as labels,
    coalesce(JSON_EXTRACT(workflow,'$.metadata.annotations'), '{}') as annotations,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.status.progress')), '') as progress,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.metadata.creationTimestamp')), '') as creationtimestamp,
    JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.spec.suspend')) as suspend,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.status.message')), '') as message,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.status.estimatedDuration')), '0') as estimatedduration,
    coalesce(JSON_EXTRACT(workflow,'$.status.resourcesDuration'), '{}') as resourcesduration
FROM `argo_archived_workflows` force index (argo_archived_workflows_i4)
WHERE ((`clustername` = 'default' AND `instanceid` = '') AND `namespace` = 'gllue-web' AND not exists
    ( select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = 'workflows.argoproj.io/controller-instanceid')
           AND exists
    ( select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = 'workflows.argoproj.io/phase' and value in ('Error', 'Failed', 'Pending', 'Running', 'Succeeded'))
    )
ORDER BY `startedat` DESC
LIMIT 20;

it drops from 37s 768ms to 138 ms

union index (clustername, startedat) to use Backward index scan instead of filesort

CREATE INDEX argo_archived_workflows_i5 ON argo_archived_workflows (clustername, startedat);

it drops from 37s to 140 ms

It sounds like the force index and new index are both faster than the subquery, at least on MySQL 8.0, I wonder if we should use those instead of the subquery.
force index is potentially not the most resilient, but the new index seems like a "why not?" to me

gyanprakash48 · 2024-09-11T15:10:20Z

For other folks here who haven't been following the PR, I pushed up agilgur5/argo-cli:fix-archive-list-query-optimization for testing.

@agilgur5 , Before i test, will there be any breaking changes(in DB side) . will i be able to rollback ?

agilgur5 · 2024-09-11T19:36:45Z

No breaking changes, that's what makes it a viable choice for a patch. It's just a change to the list query itself. Can rollback without issue. But ofc I would try in staging first as regular precaution.

If you have DB access, you can also test by running the query yourself, it's 1. above for MySQL and #13566's opening comment has the Postgres variant

ryancurrah · 2024-09-12T14:36:55Z

@agilgur5 the image name is argo-cli can I use it to replace argo-server and argo-executor?

agilgur5 · 2024-09-12T14:44:40Z

No, you only need to replace the Server, since this query happens on the Server

ryancurrah · 2024-09-12T16:33:25Z

I can confirm this change reduces query times. Loading time in the UI on filtering labels was reduced by 94.44%.

Query time Before: 18 seconds
Query time after : 1 second
DB type: Postgresql
Version: 14
Number of rows (SELECT COUNT(*) FROM public.argo_archived_workflows;): 93135
CPUs: 24
Memory: 16GB

agilgur5 · 2024-09-14T17:37:28Z

It sounds like the force index and new index are both faster than the subquery, at least on MySQL 8.0, I wonder if we should use those instead of the subquery.
force index is potentially not the most resilient, but the new index seems like a "why not?" to me

Added a new issue for further optimizations to this query in #13601

…j#13566) Signed-off-by: Xiaofan Hu <[email protected]> Co-authored-by: Xiaofan Hu <[email protected]>

Signed-off-by: Xiaofan Hu <[email protected]> Co-authored-by: Xiaofan Hu <[email protected]>

gyanprakash48 · 2024-10-04T10:48:15Z

@agilgur5 , sorry for delay from my side in testing, but even though with agilgur5/argo-cli:fix-archive-list-query-optimization , i am not finding any improvement in UI performance . I am using PostgreSQL version 13.15

agilgur5 · 2024-10-04T12:48:47Z

sorry for delay from my side in testing, but even though with agilgur5/argo-cli:fix-archive-list-query-optimization

It's already released in 3.5.11.

i am not finding any improvement in UI performance

But you'd really have to detail why you're not seeing any improvement in API performance when other people are. Did you run the image for your Server properly? What kinds of queries are you using to the API and seeing in the DB? etc

Per #13601 (comment), unpaginated queries can still be slow (indexing doesn't help that of course)

gyanprakash48 · 2024-10-15T09:56:26Z

@agilgur5 , yes, i applied the image (agilgur5/argo-cli:fix-archive-list-query-optimization) for server and then for testing ,i simply use the UI. UI was still loading very slow (its paginated and limit 50 , default). Only difference for my use case might be , my workflows are of very large size (1000 + task) leading to very large size json file

gyanprakash48 added the type/bug label Jul 2, 2024

Joibel added the area/server label Jul 2, 2024

agilgur5 added the area/api Argo Server API label Jul 2, 2024

agilgur5 changed the title ~~Improving Argo server UI performance~~ Improving Argo API performance Jul 2, 2024

agilgur5 mentioned this issue Jul 3, 2024

3.5 ListWorkflows causes server to hang when there are lots of archived workflows #12025

Closed

3 tasks

agilgur5 changed the title ~~Improving Argo API performance~~ Improving Argo API/DB performance Jul 6, 2024

agilgur5 added the P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important label Jul 9, 2024

agilgur5 mentioned this issue Jul 9, 2024

fix: compress node status when offloading to database. Fixes #13290 #13313

Open

agilgur5 changed the title ~~Improving Argo API/DB performance~~ Improve Argo API/DB read performance Jul 9, 2024

This was referenced Jul 18, 2024

3.5: Cannot list archived workflow in UI with /archived-workflows/ #13356

Closed

3.5: Slow Archived Workflows query in Controller during estimation #13382

Closed

agilgur5 changed the title ~~Improve Argo API/DB read performance~~ Improve Archived API/DB read performance Jul 23, 2024

agilgur5 added the area/workflow-archive label Jul 23, 2024

linzhengen added a commit to linzhengen/argo-workflows that referenced this issue Jul 25, 2024

fix: improve archived api performance. Fixes argoproj#13382 argoproj#…

7dafc0d

…13295 Signed-off-by: linzhengen <[email protected]>

agilgur5 mentioned this issue Jul 25, 2024

fix: improve get archived workflow query performance during controller estimation. Fixes #13382 #13394

Merged

agilgur5 added P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority and removed P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Sep 6, 2024

agilgur5 mentioned this issue Sep 6, 2024

v3.5: slow sql query when using JSON_EXTRACT in 2cpu 4 GB mysql 8.0 #13563

Closed

4 tasks

bom-d-van mentioned this issue Sep 6, 2024

fix(api): optimise archived list query. Fixes #13295 #13566

Merged

agilgur5 closed this as completed in #13566 Sep 12, 2024

agilgur5 closed this as completed in 729ac17 Sep 12, 2024

agilgur5 mentioned this issue Sep 14, 2024

3.5: Further optimize Archive List API call / DB query #13601

Open

4 tasks

Joibel pushed a commit to pipekit/argo-workflows that referenced this issue Sep 19, 2024

fix(api): optimise archived list query. Fixes argoproj#13295 (argopro…

444d67b

…j#13566) Signed-off-by: Xiaofan Hu <[email protected]> Co-authored-by: Xiaofan Hu <[email protected]>

agilgur5 mentioned this issue Sep 19, 2024

API filter by workflow level arguments #13621

Open

Joibel pushed a commit that referenced this issue Sep 20, 2024

fix(api): optimise archived list query. Fixes #13295 (#13566)

f4f7dbd

Signed-off-by: Xiaofan Hu <[email protected]> Co-authored-by: Xiaofan Hu <[email protected]>

agilgur5 mentioned this issue Sep 30, 2024

3.5.6+ items.status.nodes disappeared from /api/v1/workflows endpoint for completed Workflows #13098

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.5: Improve Archived API/DB read performance #13295

3.5: Improve Archived API/DB read performance #13295

gyanprakash48 commented Jul 2, 2024 •

edited by agilgur5

Loading

gyanprakash48 commented Jul 3, 2024 •

edited by agilgur5

Loading

gyanprakash48 commented Jul 3, 2024 •

edited by agilgur5

Loading

driv commented Jul 3, 2024 •

edited

Loading

gyanprakash48 commented Jul 3, 2024 •

edited by agilgur5

Loading

driv commented Jul 3, 2024

gyanprakash48 commented Jul 3, 2024 •

edited

Loading

gyanprakash48 commented Jul 3, 2024 •

edited

Loading

agilgur5 commented Jul 6, 2024 •

edited

Loading

jlpgz commented Jul 8, 2024 •

edited

Loading

jlpgz commented Jul 9, 2024

jiachengxu commented Jul 9, 2024 •

edited

Loading

agilgur5 commented Jul 9, 2024

hmoulart commented Jul 10, 2024

jlpgz commented Jul 10, 2024 •

edited

Loading

gyanprakash48 commented Jul 17, 2024

jlpgz commented Jul 31, 2024

agilgur5 commented Jul 31, 2024 •

edited

Loading

agilgur5 commented Sep 7, 2024 •

edited

Loading

ryancurrah commented Sep 9, 2024 •

edited

Loading

agilgur5 commented Sep 9, 2024 •

edited

Loading

agilgur5 commented Sep 10, 2024 •

edited

Loading

agilgur5 commented Sep 10, 2024 •

edited

Loading

gyanprakash48 commented Sep 11, 2024 •

edited by agilgur5

Loading

agilgur5 commented Sep 11, 2024

ryancurrah commented Sep 12, 2024

agilgur5 commented Sep 12, 2024

ryancurrah commented Sep 12, 2024

agilgur5 commented Sep 14, 2024 •

edited

Loading

gyanprakash48 commented Oct 4, 2024 •

edited by agilgur5

Loading

agilgur5 commented Oct 4, 2024

gyanprakash48 commented Oct 15, 2024 •

edited by agilgur5

Loading

3.5: Improve Archived API/DB read performance #13295

3.5: Improve Archived API/DB read performance #13295

Comments

gyanprakash48 commented Jul 2, 2024 • edited by agilgur5 Loading

Pre-requisites

What happened/what did you expect to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

gyanprakash48 commented Jul 3, 2024 • edited by agilgur5 Loading

gyanprakash48 commented Jul 3, 2024 • edited by agilgur5 Loading

driv commented Jul 3, 2024 • edited Loading

gyanprakash48 commented Jul 3, 2024 • edited by agilgur5 Loading

driv commented Jul 3, 2024

gyanprakash48 commented Jul 3, 2024 • edited Loading

gyanprakash48 commented Jul 3, 2024 • edited Loading

agilgur5 commented Jul 6, 2024 • edited Loading

jlpgz commented Jul 8, 2024 • edited Loading

jlpgz commented Jul 9, 2024

jiachengxu commented Jul 9, 2024 • edited Loading

agilgur5 commented Jul 9, 2024

hmoulart commented Jul 10, 2024

jlpgz commented Jul 10, 2024 • edited Loading

gyanprakash48 commented Jul 17, 2024

jlpgz commented Jul 31, 2024

agilgur5 commented Jul 31, 2024 • edited Loading

agilgur5 commented Sep 7, 2024 • edited Loading

ryancurrah commented Sep 9, 2024 • edited Loading

agilgur5 commented Sep 9, 2024 • edited Loading

agilgur5 commented Sep 10, 2024 • edited Loading

agilgur5 commented Sep 10, 2024 • edited Loading

gyanprakash48 commented Sep 11, 2024 • edited by agilgur5 Loading

agilgur5 commented Sep 11, 2024

ryancurrah commented Sep 12, 2024

agilgur5 commented Sep 12, 2024

ryancurrah commented Sep 12, 2024

agilgur5 commented Sep 14, 2024 • edited Loading

gyanprakash48 commented Oct 4, 2024 • edited by agilgur5 Loading

agilgur5 commented Oct 4, 2024

gyanprakash48 commented Oct 15, 2024 • edited by agilgur5 Loading

gyanprakash48 commented Jul 2, 2024 •

edited by agilgur5

Loading

gyanprakash48 commented Jul 3, 2024 •

edited by agilgur5

Loading

gyanprakash48 commented Jul 3, 2024 •

edited by agilgur5

Loading

driv commented Jul 3, 2024 •

edited

Loading

gyanprakash48 commented Jul 3, 2024 •

edited by agilgur5

Loading

gyanprakash48 commented Jul 3, 2024 •

edited

Loading

gyanprakash48 commented Jul 3, 2024 •

edited

Loading

agilgur5 commented Jul 6, 2024 •

edited

Loading

jlpgz commented Jul 8, 2024 •

edited

Loading

jiachengxu commented Jul 9, 2024 •

edited

Loading

jlpgz commented Jul 10, 2024 •

edited

Loading

agilgur5 commented Jul 31, 2024 •

edited

Loading

agilgur5 commented Sep 7, 2024 •

edited

Loading

ryancurrah commented Sep 9, 2024 •

edited

Loading

agilgur5 commented Sep 9, 2024 •

edited

Loading

agilgur5 commented Sep 10, 2024 •

edited

Loading

agilgur5 commented Sep 10, 2024 •

edited

Loading

gyanprakash48 commented Sep 11, 2024 •

edited by agilgur5

Loading

agilgur5 commented Sep 14, 2024 •

edited

Loading

gyanprakash48 commented Oct 4, 2024 •

edited by agilgur5

Loading

gyanprakash48 commented Oct 15, 2024 •

edited by agilgur5

Loading