Use shard split copy code for blocking shard moves #6098

JelteF · 2022-07-28T12:44:37Z

DESCRIPTION: Improve performance of blocking shard moves

The new shard copy code that was created for shard splits has some
advantages over the old shard copy code. The old code was using
worker_append_table_to_shard, which wrote to disk twice. And it also
didn't use binary copy when that was possible. Both of these issues
were fixed in the new copy code. This PR starts using this new copy
logic also for shard moves, not just for shard splits.

On my local machine I created a single shard table like this.

set citus.shard_count = 1;
create table t(id bigint, a bigint);
select create_distributed_table('t', 'id');

INSERT into t(id, a) SELECT i, i from generate_series(1, 100000000) i;

I then turned fsync off to make sure I wasn't bottlenecked by disk.
Finally I moved this shard between nodes with citus_move_shard_placement
with block_writes.

Before this PR a move took ~127s, after this PR it took only ~38s. So for this
small test this resulted in spending ~70% less time.

And I also tried the same test for a table that contained large strings:

set citus.shard_count = 1;
create table t(id bigint, a bigint, content text);
select create_distributed_table('t', 'id');

INSERT into t(id, a, content) SELECT i, i, 'aunethautnehoautnheaotnuhetnohueoutnehotnuhetncouhaeohuaeochgrhgd.athbetndairgexdbuhaobulrhdbaetoausnetohuracehousncaoehuesousnaceohuenacouhancoexdaseohusnaetobuetnoduhasneouhaceohusnaoetcuhmsnaetohuacoeuhebtokteaoshetouhsanetouhaoug.lcuahesonuthaseauhcoerhuaoecuh.lg;rcydabsnetabuesabhenth' from generate_series(1, 20000000) i;

The result was less astonishing there, but still quite good:
Before ~60s after ~37s, so spending ~38% less time.

src/backend/distributed/operations/worker_split_copy_udf.c

src/backend/distributed/operations/repair_shards.c

src/test/regress/sql/ignoring_orphaned_shards.sql

marcocitus · 2022-07-28T17:32:16Z

src/backend/distributed/utils/reference_table_utils.c

+			 * the shard, this is all fine so we temporarily allow it.
+			 */
+			ExecuteCriticalRemoteCommand(connection,
+										 "SET LOCAL citus.allow_nested_distributed_execution = true");


in the rebalancer we do this by setting the application_name:

StringInfo setApplicationName = makeStringInfo(); appendStringInfo(setApplicationName, "SET application_name TO %s", CITUS_REBALANCER_NAME);

wondering whether we should do something similar here (maybe generalized the application_name)

I changed this to use the same application_name trick, but I don't think it's worth spending time on to make this more generic. Since the rebalancer won't run commands like this anymore in the near future, because of Nils his background daemon changes.

src/backend/distributed/operations/worker_split_copy_udf.c

src/backend/distributed/operations/repair_shards.c

src/backend/distributed/operations/worker_copy_udf.c

src/backend/distributed/operations/repair_shards.c

marcocitus · 2022-07-29T15:50:26Z

src/backend/distributed/operations/worker_copy_udf.c

+	Oid relationId = PG_GETARG_OID(0);
+	uint32_t targetNodeId = PG_GETARG_INT32(1);
+
+	if (IsCitusTable(relationId))


Do we need this check? We're reading from the pg_dist_partition metadata, which we determined might not be ideal.

After checking I guess we don't need it. Even without it we will get this error:

+ERROR: cannot execute a distributed query from a query on a shard
+DETAIL: Executing a distributed query in a function call that may be pushed to a remote node can lead to incorrect results.

@marcocitus , @JelteF , by any chance, do you remember why reading form the pg_dist_partition metadata "might not be ideal" and why you relied on ERROR: cannot execute a distributed query from a query on a shard?
#6795 shows that in some cases that error may not occur.
May be you should reconsider this decision and check for Citus tables in worker_copy_table_to_node?

I'm guessing because this might happen in the context of a transaction that modifies pg_dist_partition, and if those happen over a different connection then they might not be visible here.

I'm not sure I understand what you mean. Could you be more specific about what problem might be caused by this check?

Basing on your comment I can only come up with the following:
There is a Citus table.
On connection A this table is converted to a regular table.
On connection B we call worker_copy_table_to_node for this table, and it fails with error for it doesn't see the changes and still sees that the table is a Citus table.

In this case why can't we wait until the changes become visible and retry worker_copy_table_to_node? It seems to me that till we see the table as a Citus table we shouldn't call worker_copy_table_to_node for it

In this case why can't we wait until the changes become visible

because changes only become visible when the coordinator commits and finalizes a 2PC, which won't happen until it's done waiting for worker_copy_table_to_node, because committing is always the last thing it does

src/backend/distributed/operations/worker_copy_udf.c

The new shard copy code that was created for shard splits has some advantages over the old code. This one uses binary copy when possible to make the copy faster. When doing a shard move using `block_writes` it now uses this better copy logic.

The new shard copy code that was created for shard splits has some advantages over the old shard copy code. The old code was using worker_append_table_to_shard, which wrote to disk twice. And it also didn't use binary copy when that was possible. Both of these issues were fixed in the new copy code. This PR starts using this new copy logic also for shard moves, not just for shard splits. On my local machine I created a single shard table like this. ```sql set citus.shard_count = 1; create table t(id bigint, a bigint); select create_distributed_table('t', 'id'); INSERT into t(id, a) SELECT i, i from generate_series(1, 100000000) i; ``` I then turned `fsync` off to make sure I wasn't bottlenecked by disk. Finally I moved this shard between nodes with `citus_move_shard_placement` with `block_writes`. Before this PR a move took ~127s, after this PR it took only ~38s. So for this small test this resulted in spending ~70% less time. And I also tried the same test for a table that contained large strings: ```sql set citus.shard_count = 1; create table t(id bigint, a bigint, content text); select create_distributed_table('t', 'id'); INSERT into t(id, a, content) SELECT i, i, 'aunethautnehoautnheaotnuhetnohueoutnehotnuhetncouhaeohuaeochgrhgd.athbetndairgexdbuhaobulrhdbaetoausnetohuracehousncaoehuesousnaceohuenacouhancoexdaseohusnaetobuetnoduhasneouhaceohusnaoetcuhmsnaetohuacoeuhebtokteaoshetouhsanetouhaoug.lcuahesonuthaseauhcoerhuaoecuh.lg;rcydabsnetabuesabhenth' from generate_series(1, 20000000) i; ```

JelteF force-pushed the copy-shard-move-new-api branch 6 times, most recently from b2afc83 to 88e2d90 Compare July 28, 2022 15:34

marcocitus reviewed Jul 28, 2022

View reviewed changes

src/backend/distributed/operations/worker_split_copy_udf.c Outdated Show resolved Hide resolved

marcocitus reviewed Jul 28, 2022

View reviewed changes

src/backend/distributed/operations/repair_shards.c Show resolved Hide resolved

marcocitus reviewed Jul 28, 2022

View reviewed changes

src/backend/distributed/operations/worker_split_copy_udf.c Outdated Show resolved Hide resolved

niupre reviewed Jul 28, 2022

View reviewed changes

src/backend/distributed/operations/repair_shards.c Outdated Show resolved Hide resolved

JelteF force-pushed the copy-shard-move-new-api branch 6 times, most recently from 30c16c4 to 7bed347 Compare July 29, 2022 14:17

marcocitus reviewed Jul 29, 2022

View reviewed changes

JelteF force-pushed the copy-shard-move-new-api branch 5 times, most recently from 4e8b63d to 1922ddf Compare August 1, 2022 11:25

marcocitus approved these changes Aug 1, 2022

View reviewed changes

JelteF force-pushed the copy-shard-move-new-api branch from 1922ddf to 0c10107 Compare August 1, 2022 17:02

JelteF enabled auto-merge (squash) August 1, 2022 17:02

JelteF merged commit abffa6c into main Aug 1, 2022

JelteF deleted the copy-shard-move-new-api branch August 1, 2022 17:10

Green-Chan mentioned this pull request Jul 25, 2024

Check for Citus table in worker_copy_table_to_node #7662

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use shard split copy code for blocking shard moves #6098

Use shard split copy code for blocking shard moves #6098

JelteF commented Jul 28, 2022 •

edited

Loading

marcocitus Jul 28, 2022

JelteF Jul 29, 2022

marcocitus Jul 29, 2022

JelteF Aug 1, 2022

Green-Chan Jul 25, 2024

marcoslot Jul 25, 2024

Green-Chan Jul 25, 2024

marcoslot Jul 25, 2024

Use shard split copy code for blocking shard moves #6098

Use shard split copy code for blocking shard moves #6098

Conversation

JelteF commented Jul 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JelteF commented Jul 28, 2022 •

edited

Loading