Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for postgres-xl table distribution #1697

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

Jmorjsm
Copy link

@Jmorjsm Jmorjsm commented Feb 13, 2021

This adds support for specifying a distribution strategy (docs) when creating tables against postgres-xl.

[ 
  DISTRIBUTE BY { REPLICATION | ROUNDROBIN | { [HASH | MODULO ] ( column_name ) } } |
  DISTRIBUTED { { BY ( column_name ) } | { RANDOMLY } |
  DISTSTYLE { EVEN | KEY | ALL } DISTKEY ( column_name )
]

Copy link
Member

@roji roji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jmorjsm thanks, that looks like a good start!

Here's a first round of review comments.

  • I wouldn't worry too much about PostgreSQL-XL vs. Greenplum - the two syntaxes definitely look different enough to warrant separate APIs, even if one partially subsumes the other.
  • Isn't there a missing closing curly brace in the PostgreSQL-XL spec, i.e. the one that's opened after DISTRIBUTED? I have a hard time understand which options exactly are mutually exclusive here.
  • Somewhat related: the PostgreSQL-XL syntax (which I know nothing about!) seems to distinguish between DISTRIBUTE BY and DISTRIBUTED BY; the former for REPLICATION/ROUNDROBIN/<column> with/without function, the latter for RANDOMLY/<column>. Does that mean there are really two ways to define things with a column (one with DISTRIBUTE BY, one with DISTRIBUTED BY)? I don't know if any of this is significant, but ideally the builder's FluentAPI would match this the DDL syntax (so separate DistributeBy and DistributedBy)
  • In theory, we should validate that there aren't two entities mapped to the same table ("table splitting") with different DISTRIBUTE BY configurations. This is pretty far-fetched, but if you want to do it, take a look at what I did in SqlServerModelValidator in https://github.com/dotnet/efcore/pull/23904/files.

src/EFCore.PG/Metadata/PostgresXlDistributeBy.cs Outdated Show resolved Hide resolved
src/EFCore.PG/Metadata/PostgresXlDistributeBy.cs Outdated Show resolved Hide resolved
src/EFCore.PG/Metadata/PostgresXlDistributeBy.cs Outdated Show resolved Hide resolved
src/EFCore.PG/Metadata/PostgresXlDistributeByStrategy.cs Outdated Show resolved Hide resolved
{
var distributeBy = new PostgresXlDistributeBy(operation);

var strategy = distributeBy.DistributionStrategy;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can define a Deconstruct method on PostgresXlDistributeBy and use it in one line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the standard ValueTuple Deconstruct method as described here.

src/EFCore.PG/Migrations/NpgsqlMigrationsSqlGenerator.cs Outdated Show resolved Hide resolved
src/EFCore.PG/Migrations/NpgsqlMigrationsSqlGenerator.cs Outdated Show resolved Hide resolved
src/EFCore.PG/Migrations/NpgsqlMigrationsSqlGenerator.cs Outdated Show resolved Hide resolved
@Jmorjsm
Copy link
Author

Jmorjsm commented Feb 28, 2021

Thanks for your feedback here!
I did some further testing against a local postgres-xl instance and saw that the following queries were possible:

CREATE TABLE distribute_test.test_distribute_by_replication (id int, first_name varchar(100), last_name varchar(100), age int) DISTRIBUTE BY REPLICATION ;
CREATE TABLE distribute_test.test_distribute_by_roundrobin (id int, first_name varchar(100), last_name varchar(100), age int) DISTRIBUTE BY ROUNDROBIN ;
CREATE TABLE distribute_test.test_distribute_by_hash_id (id int, first_name varchar(100), last_name varchar(100), age int) DISTRIBUTE BY HASH (id);
CREATE TABLE distribute_test.test_distribute_by_modulo_id (id int, first_name varchar(100), last_name varchar(100), age int) DISTRIBUTE BY MODULO (id);

-- DOES NOT WORK: CREATE TABLE distribute_test.test_distribute_by_id (id int, first_name varchar(100), last_name varchar(100), age int) DISTRIBUTE BY (id);

CREATE TABLE distribute_test.test_distributed_by_id (id int, first_name varchar(100), last_name varchar(100), age int) DISTRIBUTED BY (id);
CREATE TABLE distribute_test.test_distributed_randomly (id int, first_name varchar(100), last_name varchar(100), age int) DISTRIBUTED RANDOMLY;

CREATE TABLE distribute_test.test_diststyle_even (id int, first_name varchar(100), last_name varchar(100), age int) DISTSTYLE EVEN;
CREATE TABLE distribute_test.test_diststyle_all (id int, first_name varchar(100), last_name varchar(100), age int) DISTSTYLE ALL;
CREATE TABLE distribute_test.test_diststyle_key_distkey_id (id int, first_name varchar(100), last_name varchar(100), age int) DISTSTYLE KEY DISTKEY (id);

Looks like DISTRIBUTED BY ({columnname}) is valid, while DISTRIBUTE BY ({columnname}) is invalid, as well as DISTSTYLE EVEN or ALL also being invalid with a distribution key.

@@ -1821,6 +1884,118 @@ public IndexColumn(string name, string @operator, string collation, SortOrder so
public NullSortOrder NullSortOrder { get; }
}

private static void ValidateTableDistributionProperties(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a local method at the end of Generate(CreateTableOperation), as it's not called by anyone else.

Also, distributedByColumnName doesn't actually seem to be check anywhere.

@roji
Copy link
Member

roji commented Mar 5, 2021

I've answered some questions, let me know if you're blocking on anything from my side (note there's still also my questions in #1697 (review) which may need addressing).

@roji
Copy link
Member

roji commented Mar 5, 2021

Looks like DISTRIBUTED BY ({columnname}) is valid, while DISTRIBUTE BY ({columnname}) is invalid, as well as DISTSTYLE EVEN or ALL also being invalid with a distribution key.

Yeah, it seems... complicated :) But I think the fluent APIs we expose should correspond to what is actually supported on the database side.

@roji
Copy link
Member

roji commented Apr 7, 2021

@Jmorjsm just to say that when this is ready for another look, please re-request a review (preferably after this passes the build too).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants