Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.5: Improve Archived API/DB read performance #13295

Closed
4 tasks done
gyanprakash48 opened this issue Jul 2, 2024 · 38 comments · Fixed by #13566
Closed
4 tasks done

3.5: Improve Archived API/DB read performance #13295

gyanprakash48 opened this issue Jul 2, 2024 · 38 comments · Fixed by #13566
Labels
area/api Argo Server API area/server area/workflow-archive P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority solution/suggested A solution to the bug has been suggested. Someone needs to implement it. solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@gyanprakash48
Copy link

gyanprakash48 commented Jul 2, 2024


edited by agilgur5:


Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

Currently , when large number of workflows are running , Workflows view of Argo workflows UI is taking up to 12 seconds to load. I don't find any way to further optimise it. CPU and Memory uses of server pod are normal. I am using Postgres to archive the workflows and cpu and memory usage of database as well are normal. i don't see any spikes anywhere. So looks like its not resource contention but something else, What can be done to improve the load time .

Version

3.5.0, 3.5.6, 3.5.8,

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Its UI load time (workflow view)

Logs from the workflow controller

Its UI load time (workflow view)

Logs from in your workflow's wait container

Its UI load time (workflow view)
@agilgur5 agilgur5 added the area/api Argo Server API label Jul 2, 2024
@agilgur5 agilgur5 changed the title Improving Argo server UI performance Improving Argo API performance Jul 2, 2024
@gyanprakash48
Copy link
Author

gyanprakash48 commented Jul 3, 2024

i tried to trace it and seem like below query is culprit . As per query performance insight(Azure Postgres flexi server) , this query execution mean time is 9.304s, Which explains, over all api time is 12-13 seconds

select
	name,
	namespace,
	uid,
	phase,
	startedat,
	finishedat,
	coalesce((workflow::json)->'metadata'->>'labels',
	'{}') as labels,
	coalesce((workflow::json)->'metadata'->>'annotations',
	'{}') as annotations,
	coalesce((workflow::json)->'status'->>'progress',
	'') as progress,
	coalesce((workflow::json)->'metadata'->>'creationTimestamp',
	'') as creationtimestamp,
	(workflow::json)->'spec'->>'suspend' as suspend,
	coalesce((workflow::json)->'status'->>'message',
	'') as message,
	coalesce((workflow::json)->'status'->>'estimatedDuration',
	'0') as estimatedduration,
	coalesce((workflow::json)->'status'->>'resourcesDuration',
	'{}') as resourcesduration
from
	"argo_archived_workflows"
where
	(("clustername" = $1
		and "namespace" = $2
		and "instanceid" = $3)
	and "namespace" = $4
	and not exists (
	select
		1
	from
		argo_archived_workflows_labels
	where
		clustername = argo_archived_workflows.clustername
		and uid = argo_archived_workflows.uid
		and name = 'workflows.argoproj.io/controller-instanceid'))
order by
	"startedat" desc
limit 50

@gyanprakash48
Copy link
Author

gyanprakash48 commented Jul 3, 2024

These json functions are culprit,

coalesce((workflow::json)->'metadata'->>'labels',
'{}') as labels,
coalesce((workflow::json)->'metadata'->>'annotations',
'{}') as annotations,
coalesce((workflow::json)->'status'->>'progress',
'') as progress,
coalesce((workflow::json)->'metadata'->>'creationTimestamp',
'') as creationtimestamp,
(workflow::json)->'spec'->>'suspend' as suspend,
coalesce((workflow::json)->'status'->>'message',
'') as message,
coalesce((workflow::json)->'status'->>'estimatedDuration',
'0') as estimatedduration,
coalesce((workflow::json)->'status'->>'resourcesDuration',

@driv
Copy link

driv commented Jul 3, 2024

I've been testing the queries from the logs of 3.5.8 and I can see a huge difference between MySQL 5.7 and 8.

The queries execution plans are different. On 8 it executes more than 2 orders of magnitude faster.

I'm attaching the queries EXPLAIN.

explain.57.md
explain.8.md

MySQL 8 is able to use range.

@gyanprakash48
Copy link
Author

gyanprakash48 commented Jul 3, 2024

hmm, i am using Postgres Flexi server , so not sure if it comparable.
What about storing these columns as generated column in the table, and in api query using those generated columns. something like below. I tested it for same data where currently it is taking 9 sec, revised query completing in milliseconds. does it have any downside ? Get Workflows route is most frequently used in UI and its slowness taking away all the charm of argo workflows (Users always feel frustrated).

table DDL

CREATE TABLE argo_archived_workflows (
	uid varchar(128) NOT NULL,
	"name" varchar(256) NOT NULL,
	phase varchar(25) NOT NULL,
	"namespace" varchar(256) NOT NULL,
	workflow json NOT NULL,
	startedat timestamp DEFAULT CURRENT_TIMESTAMP NOT NULL,
	finishedat timestamp DEFAULT CURRENT_TIMESTAMP NOT NULL,
	clustername varchar(64) NOT NULL,
	instanceid varchar(64) NOT NULL,
	labels text GENERATED ALWAYS AS ((workflow::json)->'metadata'->>'labels') stored,
	annotations text GENERATED ALWAYS as ( (workflow::json)->'metadata'->>'annotations') stored,
	progress text GENERATED ALWAYS AS ((workflow::json)->'status'->>'progress') stored,
	creationtimestamp text GENERATED ALWAYS AS ((workflow::json)->'metadata'->>'creationTimestamp') stored,
	suspend text GENERATED ALWAYS AS ((workflow::json)->'spec'->>'suspend') stored,
	message text GENERATED ALWAYS AS ((workflow::json)->'status'->>'message') stored,
	estimatedduration text GENERATED ALWAYS AS ((workflow::json)->'status'->>'estimatedDuration') stored,
	resourcesduration text GENERATED ALWAYS AS ((workflow::json)->'status'->>'resourcesDuration') stored,  
	CONSTRAINT argo_archived_workflows_pkey PRIMARY KEY (clustername, uid)
);
CREATE INDEX argo_archived_workflows_i1 ON argo_archived_workflows USING btree (clustername, instanceid, namespace);
CREATE INDEX argo_archived_workflows_i2 ON argo_archived_workflows USING btree (clustername, instanceid, finishedat);
CREATE INDEX argo_archived_workflows_i3 ON argo_archived_workflows USING btree (clustername, instanceid, name);
CREATE INDEX argo_archived_workflows_i4 ON argo_archived_workflows USING btree (startedat);

Revised Query

select
	name,
	namespace,
	uid,
	phase,
	startedat,
	finishedat,
	coalesce(labels,'{}') as labels,
	coalesce(annotations,'{}') as annotations,
	coalesce(progress,'') as progress,
	coalesce(creationtimestamp,'') as creationtimestamp,
	suspend,
	coalesce(message,'') as message,
	coalesce(estimatedduration,'0') as estimatedduration,
	coalesce(resourcesduration,'{}') as resourcesduration
from
	argo_archived_workflows
where
	(("clustername" = $1
		and "namespace" = $2
		and "instanceid" = $3)
	and "namespace" = $4
	and not exists (
	select
		1
	from
		argo_archived_workflows_labels
	where
		clustername = 'default'
		and uid = argo_archived_workflows.uid
		and name = 'workflows.argoproj.io/controller-instanceid'))
order by startedat desc
limit 45;

@driv
Copy link

driv commented Jul 3, 2024

Given that the query has a limit 50 and those fields are not used in filtering or sorting, I doubt they are responsible for the long query duration.

Can you share an EXPLAIN?

@gyanprakash48
Copy link
Author

gyanprakash48 commented Jul 3, 2024

in the beginning i though that too, but trying different strategy to switch different explain plan was not helping at all . Then i comment out those json function column in select, and suddenly query was super fast. Even Though they are not part of sorting and filtering, but these json are very big in size(If workflows is large , at least thats the case in my situation)

@gyanprakash48
Copy link
Author

gyanprakash48 commented Jul 3, 2024

log
May be this recording will help further. I am sure this is only, one of the of the scenario (must be many others)

i ran both aspect twice (just to avoid concern like catch an all )

@agilgur5
Copy link
Member

agilgur5 commented Jul 6, 2024

Then i comment out those json function column in select, and suddenly query was super fast. Even Though they are not part of sorting and filtering, but these json are very big in size(If workflows is large , at least thats the case in my situation)

Yea they can be very large, though I'm surprised it takes that long, since it's not part of the filter. I'm wondering if the engine is extracting the JSON multiple times, once for each field?

I tested it for same data where currently it is taking 9 sec, revised query completing in milliseconds. does it have any downside ?

	labels text GENERATED ALWAYS AS ((workflow::json)->'metadata'->>'labels') stored,
	annotations text GENERATED ALWAYS as ( (workflow::json)->'metadata'->>'annotations') stored,
	progress text GENERATED ALWAYS AS ((workflow::json)->'status'->>'progress') stored,
	creationtimestamp text GENERATED ALWAYS AS ((workflow::json)->'metadata'->>'creationTimestamp') stored,
	suspend text GENERATED ALWAYS AS ((workflow::json)->'spec'->>'suspend') stored,
	message text GENERATED ALWAYS AS ((workflow::json)->'status'->>'message') stored,
	estimatedduration text GENERATED ALWAYS AS ((workflow::json)->'status'->>'estimatedDuration') stored,
	resourcesduration text GENERATED ALWAYS AS ((workflow::json)->'status'->>'resourcesDuration') stored,

We don't currently filter on those, so it's a little unnecessary as individual columns, but otherwise that makes sense to me if it's compatible with MySQL as well. I'm wondering if we can just use a materialized view instead for similar effect?

cc @jiachengxu @terrytangyuan @jessesuen

@agilgur5 agilgur5 changed the title Improving Argo API performance Improving Argo API/DB performance Jul 6, 2024
@jlpgz
Copy link

jlpgz commented Jul 8, 2024

We are experiencing the same problem using PostgreSQL. There are nearly 1 million archived workflows in the database, which are quite large, making the query extremely expensive to run. As a result, the UI becomes unusable due to its slow response time.

I have always wondered why the controller does not create a JSONB column to store the workflow information.

@agilgur5 agilgur5 added the P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important label Jul 9, 2024
@jlpgz
Copy link

jlpgz commented Jul 9, 2024

Would it be possible to change the workflow column to JSONB in Postgres via migration? JSON operations would be faster and we could use GIN indexes.

@jiachengxu
Copy link
Member

jiachengxu commented Jul 9, 2024

Would it be possible to change the workflow column to JSONB in Postgres via migration? JSON operations would be faster and we could use GIN indexes.

I think this is a good point, currently for Postgres, we use json but not JSONB. In the archived workflow case(heavy read), I think it makes sense to use JSONB since:

The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage.

https://www.postgresql.org/docs/current/datatype-json.html

Yea they can be very large, though I'm surprised it takes that long, since it's not part of the filter. I'm wondering if the engine is extracting the JSON multiple times, once for each field?

@agilgur5 To your point, from the above quote "The json data type stores an exact copy of the input text, which processing functions must reparse on each execution", I think for current implementation, most likely it is reparsing the json for each field.

@agilgur5
Copy link
Member

agilgur5 commented Jul 9, 2024

and we could use GIN indexes.

We don't need indexes on it currently as the JSON is not used in filtering.

I have always wondered why the controller does not create a JSONB column to store the workflow information.

Otherwise, it does sound like JSONB is the better choice for our read queries. This is only supported for Postgres though, and we'd have to add the min version of Postgres required for JSONB to the docs. I imagine lack of MySQL support and legacy code is the reason but 🤷

Is it straightforward to write a migration to JSONB?

We'd also still have to solve this for MySQL, so that alone wouldn't be enough. A materialized view or generated columns are probably needed, and compressed nodes (#13313) might help too.

I think for current implementation, most likely it is reparsing the json for each field.

That would explain why it may have gotten slower. That is odd to me that it's reparsed (as the engine should be able to store it in memory and then reuse it for the query), but materialized views / generated columns should solve that relatively cleanly, assuming that writes are not heavily slowed down (e.g. if it reparses each field during derivation/generation too)

@agilgur5 agilgur5 changed the title Improving Argo API/DB performance Improve Argo API/DB read performance Jul 9, 2024
@hmoulart
Copy link

Hello,
I have the same behavior on that part with the database overloaded with archiveTTL: 180d
Add an index on some columns was not enough...
Capture d’écran 2024-07-03 à 12 20 23
database is a AWS RDS PostgreSQL t4g.small
image
archived_workflows: 816 757
argo_archived_workflows: 37 427
Number of line is not so huge...

We created a job on PostgreSQL database to keep 2 days on the active argo table and 178d on the archive argo table. The situation is now better.
But Argo hit the database very often for no reason with the query you can see on the screen (which is the same as the query on this topic).
I'm available to help identify the issue and optimize Argo to avoid overloading databases.
Thanks for your support!

@jlpgz
Copy link

jlpgz commented Jul 10, 2024

Is it straightforward to write a migration to JSONB?

I would say yes. My only concern is how long it might take to convert the workflow column for a large number of rows.

ALTER TABLE argo_archived_workflows
ALTER COLUMN workflow
SET DATA TYPE JSONB
USING workflow::JSONB;

@gyanprakash48
Copy link
Author

changing to type JSONB solves the performance issue completely ? someone has validated it ?

@jlpgz
Copy link

jlpgz commented Jul 31, 2024

But the query is still converting the workflow column to json, isn’t it?

@agilgur5
Copy link
Member

agilgur5 commented Jul 31, 2024

Sorry for the confusion. The migration completed. So now the table is using jsonb for workflows.

After the migration, I reran the query (without the limit), and there was no change in performance

Ok thanks for clarifying. I imagine that without a limit any query is going to be slow. JSONB should still be faster than the JSON variant regardless though.

But the query is still converting the workflow column to json, isn’t it?

Is there a separate query for JSONB specifically? I actually don't know

My recommendation here is to just widen the table. It seems that, for whatever reason, postgres really doesn't like giant json blobs. If you splat out the queried fields, my assumption is it would be much faster.

Yea the generated columns approach originally suggested by OP and then me above is more or less equivalent to that and was significantly faster (order of magnitude or two) per testing in OP's comment.

@agilgur5 agilgur5 added P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority and removed P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Sep 6, 2024
bom-d-van pushed a commit to bom-d-van/argo-workflows that referenced this issue Sep 6, 2024
the current implementation appears to be causing postgres always unmarshalling workflow json
payload for all the records in the table. by adopting a subquery approach, we are able to optimise
the query from a runtime of 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows
and 1794624 argo_archived_workflows_labels.

this new change is backward compatible and it is tested on our prodction env (using postgresql).

realated issue: argoproj#13295

query change example:

previous:

```sql
SELECT name,
       namespace,
       UID,
       phase,
       startedat,
       finishedat,
       coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels,
       coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations,
       coalesce((workflow::JSON)->'status'->>'progress', '') AS progress,
       coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp,
       (workflow::JSON)->'spec'->>'suspend' AS suspend,
       coalesce((workflow::JSON)->'status'->>'message', '') AS message,
       coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration,
       coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration
FROM "argo_archived_workflows"
WHERE (("clustername" = 'default'
        AND "instanceid" = '')
       AND "namespace" = 'argo-map'
       AND EXISTS
         (SELECT 1
          FROM argo_archived_workflows_labels
          WHERE clustername = argo_archived_workflows.clustername
            AND UID = argo_archived_workflows.uid
            AND name = 'workflows.argoproj.io/phase'
            AND value = 'Succeeded')
       AND EXISTS
         (SELECT 1
          FROM argo_archived_workflows_labels
          WHERE clustername = argo_archived_workflows.clustername
            AND UID = argo_archived_workflows.uid
            AND name = 'workflows.argoproj.io/workflow-template'
            AND value = 'mapping1-pipeline-template-with-nfs'))
ORDER BY "startedat" DESC
LIMIT 1;
```

now:

```sql
SELECT name,
       namespace,
       UID,
       phase,
       startedat,
       finishedat,
       coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels,
       coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations,
       coalesce((workflow::JSON)->'status'->>'progress', '') AS progress,
       coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp,
       (workflow::JSON)->'spec'->>'suspend' AS suspend,
       coalesce((workflow::JSON)->'status'->>'message', '') AS message,
       coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration,
       coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration
FROM "argo_archived_workflows"
WHERE "clustername" = 'default'
  AND UID IN
    (SELECT UID
     FROM "argo_archived_workflows"
     WHERE (("clustername" = 'default'
             AND "instanceid" = '')
            AND "namespace" = 'argo-map'
            AND EXISTS
              (SELECT 1
               FROM argo_archived_workflows_labels
               WHERE clustername = argo_archived_workflows.clustername
                 AND UID = argo_archived_workflows.uid
                 AND name = 'workflows.argoproj.io/phase'
                 AND value = 'Succeeded')
            AND EXISTS
              (SELECT 1
               FROM argo_archived_workflows_labels
               WHERE clustername = argo_archived_workflows.clustername
                 AND UID = argo_archived_workflows.uid
                 AND name = 'workflows.argoproj.io/workflow-template'
                 AND value = 'mapping1-pipeline-template-with-nfs'))
     ORDER BY "startedat" DESC
     LIMIT 1);
```
bom-d-van pushed a commit to bom-d-van/argo-workflows that referenced this issue Sep 6, 2024
the current implementation appears to be causing postgres always unmarshalling workflow json
payload for all the records in the table. by adopting a subquery approach, we are able to optimise
the query from a runtime of 11495.734 ms to 44.713 ms. the data size is about 417481 argo_archived_workflows
and 1794624 argo_archived_workflows_labels.

this new change is backward compatible and it is tested on our prodction env (using postgresql).

realated issue: argoproj#13295

query change example:

previous:

```sql
SELECT name,
       namespace,
       UID,
       phase,
       startedat,
       finishedat,
       coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels,
       coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations,
       coalesce((workflow::JSON)->'status'->>'progress', '') AS progress,
       coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp,
       (workflow::JSON)->'spec'->>'suspend' AS suspend,
       coalesce((workflow::JSON)->'status'->>'message', '') AS message,
       coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration,
       coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration
FROM "argo_archived_workflows"
WHERE (("clustername" = 'default'
        AND "instanceid" = '')
       AND "namespace" = 'argo-map'
       AND EXISTS
         (SELECT 1
          FROM argo_archived_workflows_labels
          WHERE clustername = argo_archived_workflows.clustername
            AND UID = argo_archived_workflows.uid
            AND name = 'workflows.argoproj.io/phase'
            AND value = 'Succeeded')
       AND EXISTS
         (SELECT 1
          FROM argo_archived_workflows_labels
          WHERE clustername = argo_archived_workflows.clustername
            AND UID = argo_archived_workflows.uid
            AND name = 'workflows.argoproj.io/workflow-template'
            AND value = 'mapping1-pipeline-template-with-nfs'))
ORDER BY "startedat" DESC
LIMIT 1;
```

now:

```sql
SELECT name,
       namespace,
       UID,
       phase,
       startedat,
       finishedat,
       coalesce((workflow::JSON)->'metadata'->>'labels', '{}') AS labels,
       coalesce((workflow::JSON)->'metadata'->>'annotations', '{}') AS annotations,
       coalesce((workflow::JSON)->'status'->>'progress', '') AS progress,
       coalesce((workflow::JSON)->'metadata'->>'creationTimestamp', '') AS creationtimestamp,
       (workflow::JSON)->'spec'->>'suspend' AS suspend,
       coalesce((workflow::JSON)->'status'->>'message', '') AS message,
       coalesce((workflow::JSON)->'status'->>'estimatedDuration', '0') AS estimatedduration,
       coalesce((workflow::JSON)->'status'->>'resourcesDuration', '{}') AS resourcesduration
FROM "argo_archived_workflows"
WHERE "clustername" = 'default'
  AND UID IN
    (SELECT UID
     FROM "argo_archived_workflows"
     WHERE (("clustername" = 'default'
             AND "instanceid" = '')
            AND "namespace" = 'argo-map'
            AND EXISTS
              (SELECT 1
               FROM argo_archived_workflows_labels
               WHERE clustername = argo_archived_workflows.clustername
                 AND UID = argo_archived_workflows.uid
                 AND name = 'workflows.argoproj.io/phase'
                 AND value = 'Succeeded')
            AND EXISTS
              (SELECT 1
               FROM argo_archived_workflows_labels
               WHERE clustername = argo_archived_workflows.clustername
                 AND UID = argo_archived_workflows.uid
                 AND name = 'workflows.argoproj.io/workflow-template'
                 AND value = 'mapping1-pipeline-template-with-nfs'))
     ORDER BY "startedat" DESC
     LIMIT 1);
```

Signed-off-by: Xiaofan Hu <[email protected]>
@agilgur5
Copy link
Member

agilgur5 commented Sep 7, 2024

So I've been thinking about this after discussing in the Aug 6th Contributor Meeting with @sarabala1979. All of the options we discussed to support this had trade-offs, so no optimal solution.

I was thinking about the behavior of the Archived Workflows API prior to #11121, which takes me back to my own comment, #11121 (comment). The fix to that comment was #12912 which added the JSON_EXTRACT statements instead of retrieving the entirety of the JSON, which caused both slow queries and OOMs on the Server. Now it's just a slow query at least 😅

We could remove those fields entirely from the SELECT, i.e. never return from the Archived Workflows List API, and therefore never need to extract the JSON or add new columns. This would reflect the behavior of 3.4, but would make for some confusing inconsistencies in the unified list in 3.5+, as some items of the list would have the fields, but the archived ones would not.
This seems to not strongly affect users though afaict, because #13098 did not get any upvotes and is similar.
The main problem with this approach is that it limits the types of filtering that can be done on archived workflows (e.g. annotations are extracted from the JSON), and we expected to be able to implement more (more have always been requested) after the SQLiteDB addition in #12736 / #13021. Creating new columns for every piece of metadata users want to filter on is not necessarily sustainable either though.
So, I think if necessary, we can rollback to that behavior, and then see what we can do in the future in 3.6+.

But perhaps #13566 has the root cause fix if all the errors are worked out and we won't have to think about any more options

@ryancurrah
Copy link
Contributor

ryancurrah commented Sep 9, 2024

We could remove those fields entirely from the SELECT, i.e. never return from the Archived Workflows List API, and therefore never need to extract the JSON or add new columns. This would reflect the behavior of 3.4, but would make for some confusing inconsistencies in the unified list in 3.5+, as some items of the list would have the fields, but the archived ones would not.

Not sure if I fully understand this comment. We search archived workflows by labels frequently because we archive workflows almost immediately. Would that mean we could not do that anymore?

@agilgur5
Copy link
Member

agilgur5 commented Sep 9, 2024

I did forget a piece in my initial comment, which was that labels actually have their own table already, so it would only be other forms of information in the JSON blob that would not be filterable, like annotations, creationTimestamp, message, pieces of the status, etc. Whatever doesn't have an existing column

@agilgur5
Copy link
Member

agilgur5 commented Sep 10, 2024

But perhaps #13566 has the root cause fix if all the errors are worked out and we won't have to think about any more options

For other folks here who haven't been following the PR, I pushed up agilgur5/argo-cli:fix-archive-list-query-optimization for testing.

Please try it out and comment here with your query time improvement results (before/after) and DB type and version (MySQL vXXX)! Number of rows (45000) and resources given (2cpu 4GB memory) would also be helpful for proportional estimation.
Or lack thereof if it didn't work for you

@agilgur5
Copy link
Member

agilgur5 commented Sep 10, 2024

From @Danny5487401 in #13563 (comment)

@agilgur5 three methods i tested:

  1. nice implemention by sub,according to fix(api): optimise archived list query. Fixes #13295 #13566
SELECT
    name, namespace, uid, phase, startedat, finishedat,
    coalesce(JSON_EXTRACT(workflow,'$.metadata.labels'), '{}') as labels,
    coalesce(JSON_EXTRACT(workflow,'$.metadata.annotations'), '{}') as annotations,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.status.progress')), '') as progress,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.metadata.creationTimestamp')), '') as creationtimestamp,
    JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.spec.suspend')) as suspend,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.status.message')), '') as message,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.status.estimatedDuration')), '0') as estimatedduration,
    coalesce(JSON_EXTRACT(workflow,'$.status.resourcesDuration'), '{}') as resourcesduration
FROM `argo_archived_workflows`
WHERE `clustername` = 'default' AND uid IN
( SELECT * FROM (SELECT uid
FROM `argo_archived_workflows`
WHERE ((`clustername` = 'default' AND `instanceid` = '') AND `namespace` = 'gllue-web'
    AND not exists
    ( select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = 'workflows.argoproj.io/controller-instanceid')
    AND exists
    ( select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = 'workflows.argoproj.io/phase' and value in ('Error', 'Failed', 'Pending', 'Running', 'Succeeded'))
          )
ORDER BY `startedat` DESC
    LIMIT 20)as x);

it drops from 37s 768ms to 1 s 981 ms

  1. force index (argo_archived_workflows_i4)
SELECT
    name, namespace, uid, phase, startedat, finishedat,
    coalesce(JSON_EXTRACT(workflow,'$.metadata.labels'), '{}') as labels,
    coalesce(JSON_EXTRACT(workflow,'$.metadata.annotations'), '{}') as annotations,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.status.progress')), '') as progress,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.metadata.creationTimestamp')), '') as creationtimestamp,
    JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.spec.suspend')) as suspend,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.status.message')), '') as message,
    coalesce(JSON_UNQUOTE(JSON_EXTRACT(workflow,'$.status.estimatedDuration')), '0') as estimatedduration,
    coalesce(JSON_EXTRACT(workflow,'$.status.resourcesDuration'), '{}') as resourcesduration
FROM `argo_archived_workflows` force index (argo_archived_workflows_i4)
WHERE ((`clustername` = 'default' AND `instanceid` = '') AND `namespace` = 'gllue-web' AND not exists
    ( select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = 'workflows.argoproj.io/controller-instanceid')
           AND exists
    ( select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = 'workflows.argoproj.io/phase' and value in ('Error', 'Failed', 'Pending', 'Running', 'Succeeded'))
    )
ORDER BY `startedat` DESC
LIMIT 20;

it drops from 37s 768ms to 138 ms

  1. union index (clustername, startedat) to use Backward index scan instead of filesort
CREATE INDEX argo_archived_workflows_i5 ON argo_archived_workflows (clustername, startedat);

it drops from 37s to 140 ms

It sounds like the force index and new index are both faster than the subquery, at least on MySQL 8.0, I wonder if we should use those instead of the subquery.
force index is potentially not the most resilient, but the new index seems like a "why not?" to me

@gyanprakash48
Copy link
Author

gyanprakash48 commented Sep 11, 2024

For other folks here who haven't been following the PR, I pushed up agilgur5/argo-cli:fix-archive-list-query-optimization for testing.

@agilgur5 , Before i test, will there be any breaking changes(in DB side) . will i be able to rollback ?

@agilgur5
Copy link
Member

No breaking changes, that's what makes it a viable choice for a patch. It's just a change to the list query itself. Can rollback without issue. But ofc I would try in staging first as regular precaution.

If you have DB access, you can also test by running the query yourself, it's 1. above for MySQL and #13566's opening comment has the Postgres variant

@ryancurrah
Copy link
Contributor

@agilgur5 the image name is argo-cli can I use it to replace argo-server and argo-executor?

@agilgur5
Copy link
Member

No, you only need to replace the Server, since this query happens on the Server

@ryancurrah
Copy link
Contributor

I can confirm this change reduces query times. Loading time in the UI on filtering labels was reduced by 94.44%.

Query time Before: 18 seconds
Query time after : 1 second
DB type: Postgresql
Version: 14
Number of rows (SELECT COUNT(*) FROM public.argo_archived_workflows;): 93135
CPUs: 24
Memory: 16GB

@agilgur5
Copy link
Member

agilgur5 commented Sep 14, 2024

It sounds like the force index and new index are both faster than the subquery, at least on MySQL 8.0, I wonder if we should use those instead of the subquery.
force index is potentially not the most resilient, but the new index seems like a "why not?" to me

Added a new issue for further optimizations to this query in #13601

@gyanprakash48
Copy link
Author

gyanprakash48 commented Oct 4, 2024

@agilgur5 , sorry for delay from my side in testing, but even though with agilgur5/argo-cli:fix-archive-list-query-optimization , i am not finding any improvement in UI performance . I am using PostgreSQL version 13.15

@agilgur5
Copy link
Member

agilgur5 commented Oct 4, 2024

sorry for delay from my side in testing, but even though with agilgur5/argo-cli:fix-archive-list-query-optimization

It's already released in 3.5.11.

i am not finding any improvement in UI performance

But you'd really have to detail why you're not seeing any improvement in API performance when other people are. Did you run the image for your Server properly? What kinds of queries are you using to the API and seeing in the DB? etc

Per #13601 (comment), unpaginated queries can still be slow (indexing doesn't help that of course)

@gyanprakash48
Copy link
Author

gyanprakash48 commented Oct 15, 2024

@agilgur5 , yes, i applied the image (agilgur5/argo-cli:fix-archive-list-query-optimization) for server and then for testing ,i simply use the UI. UI was still loading very slow (its paginated and limit 50 , default). Only difference for my use case might be , my workflows are of very large size (1000 + task) leading to very large size json file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Argo Server API area/server area/workflow-archive P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority solution/suggested A solution to the bug has been suggested. Someone needs to implement it. solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants